Producing charts for ‘The Changing Fortunes of the Richest Countries in Grand Slam Tennis’
By: Dr. Chris Martin
Tools / packages used: R, R Markdown,
ggplot2, tidyverse (inc. dplyr and
tidyr), plotly.
Techniques used: exploratory data analysis,
functional programming (purrr package), data visualisation,
data cleaning/reshaping/manipulation.
Chart types used: area chart, line chart, bar chart,
stacked bar chart, small multiples, heatmap, ridge chart, interactive
charts (with ggplotly).
Source data: To produce the charts, I needed data on the women’s and men’s singles entrants for each Grand Slam tournament since 1990. This came from the excellent Tennis Abstract.
This notebook produces the static data visualisations which features in my data storytelling project: The Changing Fortunes of the Richest Countries in Grand Slam Tennis. You can read the full story on my website.
A note on my data visualistion workflow
The chart produced in this notebook are ‘skeletons’ with fairly minimal styling, but all the key structural components in places. The chart are exported from this notebook as svgs. These can are then editted - adding textures, photos, annotations etc. - using graphic design software to create the final versions.
Setting up the notebook
# import packages
library(tidyverse) # for data manipulation and viz
library(knitr) # for formatting tables
# set default theme for exploratory plots
theme_set(theme_light()) # using a minimal theme to make it easier to edit
# the plots in graphic design software later on
# set default R markdown chunk options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)Reading in the data
The data is read in from csvs produced in
data_clean.Rmd, a lot of the data preparation was done in
that notebook.
gs_first_round_gdp <- read_csv("../data/results_gdp.csv") %>%
#just look up until covid pandemic (as will have distorted things)
filter(year < 2020)
# check data looks as expected
gs_first_round_gdp %>%
head() %>%
kable()| year | tourney_name | tour | name | id | ioc | country | gdp_per_capita | iso |
|---|---|---|---|---|---|---|---|---|
| 1990 | Australian Open | atp | Jim Pugh | 101004 | USA | United States | 40436.94 | USA |
| 1990 | Australian Open | atp | Ivan Lendl | 100656 | USA | United States | 40436.94 | USA |
| 1990 | Australian Open | atp | Cyril Suk | 101327 | CZE | Czechia | 23585.18 | CZE |
| 1990 | Australian Open | atp | Tomas Carbonell | 101507 | ESP | Spain | 27543.92 | ESP |
| 1990 | Australian Open | atp | Michael Brown B395 | 101895 | AUS | Australia | 31016.42 | AUS |
| 1990 | Australian Open | atp | Karel Novacek | 101120 | CZE | Czechia | 23585.18 | CZE |
gs_entries_by_country <- read_csv("../data/gs_entries_by_country.csv") %>%
# just look up until covid pandemic (as will have distorted things)
filter(year < 2020)
# check data looks as expected
gs_entries_by_country %>%
head() %>%
kable()| year | tourney_name | country_code | country | gdp_per_capita | num_first_rd | income_decile | top_20_perc |
|---|---|---|---|---|---|---|---|
| 1990 | Australian Open | USA | United States | 40436.94 | 55 | 10 | TRUE |
| 1990 | Australian Open | AUS | Australia | 31016.42 | 36 | 9 | TRUE |
| 1990 | Australian Open | FRA | France | 33732.02 | 20 | 9 | TRUE |
| 1990 | Australian Open | GER | Germany | 36699.48 | 20 | 9 | TRUE |
| 1990 | Australian Open | SWE | Sweden | 34156.82 | 14 | 9 | TRUE |
| 1990 | Australian Open | CZE | Czechia | 23585.18 | 10 | 8 | FALSE |
How have the richest 20% of countries performed at Grand Slams (1990 - 2019)?
This section focuses on how well the richest 20% of countries performed at Grand Slams (1990 - 2019). The metric used for a country’s performance is how many players they had appearing in the first round of Grand Slam tennis tournaments.
The overall trend
The performance of the richest countries declined 1900 to 2008, and then picked up again a little.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
plot_df <- gs_entries_by_country %>%
# looked performance for two groups of countries
# top 20% richest countries and the rest
mutate(top_20_perc = if_else(is.na(top_20_perc), FALSE, top_20_perc)) %>%
group_by(year, top_20_perc) %>%
summarise(num_first_rd = sum(num_first_rd)) %>%
ungroup() %>%
# calculate proportions from counts
group_by(year) %>%
mutate(perc_first_round = num_first_rd / sum(num_first_rd)) %>%
ungroup()
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
p <- ggplot(plot_df) +
# core chart
geom_area(aes(year, perc_first_round, fill = top_20_perc)) +
# format axis
scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
expand = c(0,0)) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
# tidy up presentation
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25))
p # ----------------------------------------------------------------------------
# Export the plot for editing
# ----------------------------------------------------------------------------
ggsave("../images/all_gs.svg", units = "mm", width = 600, height = 325)Is the overall trend driven by the composition of 20% richest countries changing over time?
top_twenty_countries <- gs_entries_by_country %>%
filter(top_20_perc) %>%
count(year, country, wt = num_first_rd) %>%
arrange(year, desc(n)) %>%
group_by(country) %>%
mutate(country_ave_n = mean(n)) %>%
ungroup()
country_ave_ns <- top_twenty_countries %>%
distinct(country, country_ave_n)
# look at how many time the countries appear in the top twenty percent
top_twenty_counts <- top_twenty_countries %>%
count(country) %>%
left_join(country_ave_ns) %>%
arrange(desc(n), desc(country_ave_n)) %>%
rename(years_top_20 = n)
# for ordering exploratory plot
levels <- top_twenty_counts$country
# ggplot(top_twenty_countries) +
# geom_tile(aes(x = year,
# y = factor(country, levels = rev(levels)),
# fill = n),
# colour = "#F8F7F7") +
# labs(x= NULL, y = NULL) +
#
# coord_equal() +
#
# scale_fill_gradient(low = "#D6DCE0", high = "#000DA8", trans = "log") +
# theme_minimal() +
# theme(legend.position = "none")
# create a grid to see where countries have moved out of top 20 perc
grid <- expand_grid(year = unique(top_twenty_countries$year),
country = unique(top_twenty_countries$country))
plot_df <- grid %>%
left_join(top_twenty_countries)
# deciles by year
bottom_80_countries_by_year <- gs_entries_by_country %>%
distinct(year, country, income_decile) %>%
filter(income_decile < 9)
check_bottom_80 <- function(year, country){
selector <- bottom_80_countries_by_year$year == year &
bottom_80_countries_by_year$country == country
res <- bottom_80_countries_by_year[selector, ]
if(nrow(res) == 0){
return(FALSE)
}
else {
return(res[[1,"income_decile"]] < 9)
}
}
check_bottom_80(1990, "Nigeria")## [1] TRUE
plot_df_1 <- plot_df %>%
mutate(in_bottom_80 = map2_lgl(year, country, ~check_bottom_80(.x,.y)),
n = if_else(is.na(n) & in_bottom_80,
-1, n),
n = replace_na(n, 0),
bin_n = cut(n, breaks = c(-Inf, -1e-10,0,1e10, 10, 50, 100, Inf)))
p <- ggplot(plot_df_1) +
geom_tile(aes(x = year,
y = factor(country, levels = rev(levels)),
fill = bin_n),
colour = "#F8F7F7") +
labs(x= NULL, y = NULL) +
coord_equal() +
guides(fill = guide_legend(reverse=TRUE)) +
# https://gka.github.io/palettes/#/5|s|ffffff,35469d|ffffe0,ff005e,93003a|1|1
scale_fill_manual(values = c("#E7E4E5", "white", '#cfcde7', '#9f9dce', '#6e70b6', '#35469d')) +
#scale_fill_gradient(low = "#D6DCE0", high = "#000DA8") +
scale_x_continuous(position = "top") +
theme_minimal()
pggsave("../images/image_4.svg")
# a quick test
gs_entries_by_country %>%
filter(country == "Israel") %>%
distinct(year, income_decile)## # A tibble: 29 × 2
## year income_decile
## <dbl> <dbl>
## 1 1990 8
## 2 1991 8
## 3 1992 8
## 4 1993 8
## 5 1994 8
## 6 1995 8
## 7 1996 8
## 8 1997 8
## 9 1998 8
## 10 1999 8
## # … with 19 more rows
downward trend
gs_entries_by_country_clean <- gs_entries_by_country %>%
mutate(top_20_perc = if_else(is.na(top_20_perc), FALSE, top_20_perc))
all_gs_entries_by_country <- gs_entries_by_country_clean %>%
group_by(year, country) %>%
summarise(num_first_rd_year = sum(num_first_rd)) %>%
ungroup() %>%
left_join(distinct(gs_entries_by_country_clean, country, year, top_20_perc)) %>%
# select(-c(tourney_name, num_first_rd)) %>%
# distinct() %>%
group_by(year) %>%
mutate(perc_first_round = num_first_rd_year / sum(num_first_rd_year)) %>%
ungroup()
all_gs_entries_by_country %>%
group_by(year, top_20_perc) %>%
summarise(perc_first_round = sum(perc_first_round))## # A tibble: 60 × 3
## # Groups: year [30]
## year top_20_perc perc_first_round
## <dbl> <lgl> <dbl>
## 1 1990 FALSE 0.218
## 2 1990 TRUE 0.782
## 3 1991 FALSE 0.222
## 4 1991 TRUE 0.778
## 5 1992 FALSE 0.228
## 6 1992 TRUE 0.772
## 7 1993 FALSE 0.240
## 8 1993 TRUE 0.760
## 9 1994 FALSE 0.273
## 10 1994 TRUE 0.727
## # … with 50 more rows
gs_entries_top_20_perc <- all_gs_entries_by_country %>%
filter(top_20_perc)
grid <- expand_grid(year = unique(gs_entries_top_20_perc$year),
country = unique(gs_entries_top_20_perc$country))
plot_df <- grid %>%
left_join(gs_entries_top_20_perc) %>%
mutate(perc_first_round = replace_na(perc_first_round, 0)) %>%
filter(year <= 2008) %>%
mutate(is_usa = country == "United States")
change_df <- plot_df %>%
filter(year == max(year) | year == min(year)) %>%
select(year, country, perc_first_round) %>%
pivot_wider(names_from = year,
values_from = perc_first_round,
values_fill = 0) %>%
mutate(change = `2008` - `1990`,
fall = change < 0,
change_bin = cut(change, breaks = c(-Inf, -0.1, -0.02, 0.02, 0.1, Inf))) %>%
arrange(change)
plot_df_1 <- plot_df %>%
left_join(change_df) %>%
group_by(country) %>%
mutate(ave_num_first_round = mean(num_first_rd_year, na.rm = TRUE)) %>%
ungroup() %>%
# mutate(country = fct_reorder(country, ave_num_first_round, change,
# .fun = "median"))
arrange(fall, desc(change)) %>%
mutate(country = factor(country, levels = unique(country)),
)
#visdat::vis_miss(plot_df, warn_large_data = FALSE)
p <- ggplot(plot_df_1) +
geom_area(aes(year, perc_first_round,
group = country,
fill = change_bin),
colour = "grey80", size = 0.2) +
#scale_fill_gradient2(low = "#C94A54", mid = "white", high = "#35469D") #+
# reds from https://gka.github.io/palettes/#/8|s|c94a54,fffff0|ffffe0,ff005e,93003a|1|1
# blue from https://gka.github.io/palettes/#/8|s|35469d,fffff0|ffffe0,ff005e,93003a|1|1
scale_fill_manual(values = c("#c94a54", "#efb3aa", "#fffff0", "#aeabcd")) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
breaks = c(seq(0,0.8,0.1)),
expand = expansion(mult = c(0, .1))) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2008)) +
# coord_cartesian(clip = 'off') +
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25))
#facet_wrap(~is_usa)
pggsave("../images/image_5.svg", units = "mm", width = 525, height = 350)
plotly::ggplotly(p)levels(plot_df_1$change_bin)## [1] "(-Inf,-0.1]" "(-0.1,-0.02]" "(-0.02,0.02]" "(0.02,0.1]" "(0.1, Inf]"
# get data points for plot annotation
plot_df_1 %>%
filter(country %in% c("United States", "Australia", "Sweden")) %>%
group_by(year) %>%
summarise(perc_first_round_tot = sum(perc_first_round))## # A tibble: 19 × 2
## year perc_first_round_tot
## <dbl> <dbl>
## 1 1990 0.395
## 2 1991 0.351
## 3 1992 0.326
## 4 1993 0.332
## 5 1994 0.306
## 6 1995 0.285
## 7 1996 0.264
## 8 1997 0.236
## 9 1998 0.233
## 10 1999 0.220
## 11 2000 0.217
## 12 2001 0.195
## 13 2002 0.184
## 14 2003 0.196
## 15 2004 0.179
## 16 2005 0.166
## 17 2006 0.151
## 18 2007 0.156
## 19 2008 0.139
plot_df_1 %>%
filter(country == "United States") %>%
group_by(year) %>%
summarise(perc_first_round_tot = sum(perc_first_round))## # A tibble: 19 × 2
## year perc_first_round_tot
## <dbl> <dbl>
## 1 1990 0.253
## 2 1991 0.226
## 3 1992 0.218
## 4 1993 0.216
## 5 1994 0.192
## 6 1995 0.183
## 7 1996 0.151
## 8 1997 0.140
## 9 1998 0.134
## 10 1999 0.133
## 11 2000 0.128
## 12 2001 0.118
## 13 2002 0.122
## 14 2003 0.132
## 15 2004 0.120
## 16 2005 0.109
## 17 2006 0.100
## 18 2007 0.0994
## 19 2008 0.0907
# get players for annotation
get_players <- function(country_str, year_int){
gs_first_round_gdp %>%
filter(country == country_str & year == year_int) %>%
distinct(name)
}
get_players("United States", 1990)## # A tibble: 106 × 1
## name
## <chr>
## 1 Jim Pugh
## 2 Ivan Lendl
## 3 Tim Wilkison
## 4 Todd Witsken
## 5 Glenn Layendecker
## 6 Jimmy Brown
## 7 John McEnroe
## 8 Dan Goldie
## 9 Leif Shiras
## 10 Richey Reneberg
## # … with 96 more rows
get_players("United States", 2008)## # A tibble: 41 × 1
## name
## <chr>
## 1 Sam Querrey
## 2 Vincent Spadea
## 3 Donald Young
## 4 Robby Ginepri
## 5 Scoville Jenkins
## 6 Mardy Fish
## 7 James Blake
## 8 Bobby Reynolds
## 9 Wayne Odesnik
## 10 John Isner
## # … with 31 more rows
upward trend
plot_df <- grid %>%
left_join(gs_entries_top_20_perc) %>%
mutate(perc_first_round = replace_na(perc_first_round, 0)) %>%
filter(year > 2008)
change_df <- plot_df %>%
filter(year == max(year) | year == min(year)) %>%
select(year, country, perc_first_round) %>%
pivot_wider(names_from = year,
values_from = perc_first_round,
values_fill = 0) %>%
mutate(change = `2019` - `2009`,
fall = change < 0,
change_bin = cut(change, breaks = c(-Inf, -0.1, -0.02, 0.02, 0.1, Inf))) %>%
arrange(change)
countries_of_int <- c("United States", "France", "Sweden", "Australia", "Spain")
plot_df_1 <- plot_df %>%
left_join(change_df) %>%
mutate(country = if_else(country %in% countries_of_int, country, "Other")) %>%
group_by(country, year) %>%
summarise(perc_first_round = sum(perc_first_round)) %>%
ungroup()
facet_order <- c("United States", "Australia", "Sweden", "France", "Spain", "Other")
# annotation df
annotation_df <- plot_df_1 %>%
mutate(label = round(perc_first_round * 100, 1),
num_appearances = round(perc_first_round * 256)) %>%
filter(year == max(plot_df_1$year)|
year == min(plot_df_1$year))
p <- ggplot(plot_df_1,
aes(year, perc_first_round)) +
geom_area(aes(group = country),
colour = "grey80", size = 0.2) +
ggrepel::geom_text_repel(data = annotation_df,
mapping = aes(label = num_appearances)) +
facet_wrap(~factor(country, levels = facet_order))
pcountry affects
gs_first_round_gdp %>%
count(year, country, iso) %>%
group_by(year) %>%
mutate(perc_appear = n / sum(n)) %>%
filter(iso %in% c("USA", "FRA", "ESP")) %>%
ggplot() +
geom_line(aes(year, perc_appear, colour = country))country_group_counts <- gs_first_round_gdp %>%
count(year, country, iso) %>%
mutate(colour = if_else(
iso %in% c("USA", "FRA", "ESP"), iso, "other"
))
country_group_counts %>%
ggplot() +
geom_line(aes(year, n, group = country, colour = colour)) +
scale_y_continuous(expand = c(0,0)) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25),
panel.grid.minor = element_blank()) ggsave("../images/image_7.svg", units = "mm", width = 525, height = 350)
country_group_counts %>%
filter(iso == "USA")## # A tibble: 30 × 5
## year country iso n colour
## <dbl> <chr> <chr> <int> <chr>
## 1 1990 United States USA 262 USA
## 2 1991 United States USA 232 USA
## 3 1992 United States USA 224 USA
## 4 1993 United States USA 222 USA
## 5 1994 United States USA 198 USA
## 6 1995 United States USA 188 USA
## 7 1996 United States USA 156 USA
## 8 1997 United States USA 144 USA
## 9 1998 United States USA 139 USA
## 10 1999 United States USA 138 USA
## # … with 20 more rows
top_3 <- gs_first_round_gdp %>%
count(year, country) %>%
arrange(year, desc(n)) %>%
group_by(year) %>%
mutate(rank = rank(-n)) %>%
ungroup() %>%
filter(rank == 1 | rank == 2 | rank == 3)Countries outside top twenty percent
gs_entries_the_other_80 <- gs_entries_by_country %>%
filter(!top_20_perc) %>%
count(year, country, wt = num_first_rd)
grid <- expand_grid(country = unique(gs_entries_the_other_80$country),
year = unique(gs_entries_the_other_80$year))
plot_df <- grid %>%
left_join(gs_entries_the_other_80) %>%
mutate(n = replace_na(n, 0))
p <- ggplot(plot_df,
aes(year, n, group = country)) +
geom_line()
pplotly::ggplotly(p)p <- ggplot(plot_df,
aes(year, n, fill = country)) +
geom_area()
pplotly::ggplotly(p)gs_entries_by_country %>%
filter(str_detect(str_to_lower(country), "cz"))## # A tibble: 120 × 8
## year tourney_name country_code country gdp_per_…¹ num_f…² incom…³ top_2…⁴
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
## 1 1990 Australian Open CZE Czechia 23585. 10 8 FALSE
## 2 1990 Roland Garros CZE Czechia 23585. 11 8 FALSE
## 3 1990 US Open CZE Czechia 23585. 10 8 FALSE
## 4 1990 Wimbledon CZE Czechia 23585. 10 8 FALSE
## 5 1991 Australian Open CZE Czechia 20896. 11 8 FALSE
## 6 1991 Roland Garros CZE Czechia 20896. 10 8 FALSE
## 7 1991 US Open CZE Czechia 20896. 8 8 FALSE
## 8 1991 Wimbledon CZE Czechia 20896. 9 8 FALSE
## 9 1992 Australian Open CZE Czechia 20769. 9 8 FALSE
## 10 1992 Roland Garros CZE Czechia 20769. 7 8 FALSE
## # … with 110 more rows, and abbreviated variable names ¹gdp_per_capita,
## # ²num_first_rd, ³income_decile, ⁴top_20_perc
countries_of_int <- c("Russia", "Argentina", "Czechia")
plot_df %>%
mutate(colour = if_else(country %in% countries_of_int,
country, "other")) %>%
filter(!(country == "Czechia" & year >= 2017)) %>%
ggplot(aes(year, n,
colour = colour,
group = country)) +
geom_line() +
scale_colour_manual(values = c("#A7BCD6", "#35469D", "grey95", "#C94A54")) +
scale_y_continuous(expand = c(0,0)) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25),
panel.grid.minor = element_blank()) ggsave("../images/image_8.svg", units = "mm", width = 525, height = 350)Bottom 50 percent
bottom_50_perc <- gs_entries_by_country %>%
filter(income_decile <= 5)
# num countries with first round appearances
bottom_50_perc %>%
distinct(year, country) %>%
count(year) %>%
ggplot() +
geom_col(aes(year, n), width = 0.8) +
scale_y_continuous(expand = expansion(mult = c(0, .1))) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25),
panel.grid.minor = element_blank()) ggsave("../images/image_9.svg", units = "mm", width = 525, height = 350)
# for annotation
bottom_50_perc %>%
distinct(year, country) %>%
filter(year == 1990 | year == 2013)## # A tibble: 7 × 2
## year country
## <dbl> <chr>
## 1 1990 India
## 2 1990 Nigeria
## 3 1990 Peru
## 4 2013 China
## 5 2013 Uzbekistan
## 6 2013 India
## 7 2013 Georgia
# countries from bottom 50 percent with most appearances in first round 1990 - 2019
top_n <- 5
top_n_countries <- bottom_50_perc %>%
count(country) %>%
slice_max(order_by = n, n = top_n) %>%
.$country
bottom_50_perc %>%
distinct(year,country) %>%
filter(year == max(year)) %>%
.$country## [1] "Ukraine" "South Africa" "Tunisia" "India" "Uzbekistan"
## [6] "Moldova" "Bolivia"
plot_df <- bottom_50_perc %>%
count(year, country) %>%
mutate(country = if_else(country %in% top_n_countries,
country,
"Other")) %>%
group_by(year, country) %>%
summarise(n = sum(n))
ggplot(plot_df) +
geom_col(aes(year, n, fill = country))# for story text
bottom_50_perc %>%
distinct(year, country) %>%
filter(year == 2003)## # A tibble: 17 × 2
## year country
## <dbl> <chr>
## 1 2003 Belarus
## 2 2003 Morocco
## 3 2003 Indonesia
## 4 2003 Philippines
## 5 2003 Peru
## 6 2003 Armenia
## 7 2003 Ecuador
## 8 2003 Georgia
## 9 2003 Paraguay
## 10 2003 Uzbekistan
## 11 2003 Zimbabwe
## 12 2003 Colombia
## 13 2003 Madagascar
## 14 2003 Ukraine
## 15 2003 Bosnia and Herzegovina
## 16 2003 China
## 17 2003 Tunisia
bottom_50_perc_country_counts <- bottom_50_perc %>%
count(year, country, wt = num_first_rd) %>%
group_by(country) %>%
mutate(country_ave_n = mean(n, na.rm = TRUE)) %>%
ungroup()
# look at how many times the countries appear in the bottom fifty percent
bottom_50_summary <- bottom_50_perc_country_counts %>%
count(country) %>%
rename(total_n = n) %>%
left_join(distinct(
bottom_50_perc_country_counts,
country,
country_ave_n
)) %>%
arrange(desc(country_ave_n))
# bottom_50_perc_country_counts <- bottom_50_perc_country_counts %>%
# left_join(bottom_50_summary)
# create a grid to see where countries have moved out of bottom 50 perc
grid <- expand_grid(year = unique(bottom_50_perc_country_counts$year),
country = unique(bottom_50_perc_country_counts$country))
plot_df <- grid %>%
left_join(bottom_50_perc_country_counts)
# deciles by year
top_50_countries_by_year <- gs_entries_by_country %>%
distinct(year, country, income_decile) %>%
filter(income_decile > 5)
check_top_50 <- function(year, country){
selector <- top_50_countries_by_year$year == year &
top_50_countries_by_year$country == country
res <- top_50_countries_by_year[selector, ]
if(nrow(res) == 0){
return(FALSE)
}
else {
return(res[[1,"income_decile"]] > 5)
}
}
check_top_50(1991, "United States")## [1] TRUE
plot_df_1 <- plot_df %>%
mutate(in_top_50 = map2_lgl(year, country, ~check_top_50(.x,.y)),
n = if_else(is.na(n) & in_top_50,
-1, n),
n = replace_na(n, 0),
bin_n = cut(n, breaks = c(-Inf, -1e-10,0,1e10, 5, 10, 20, Inf)))
# for ordering exploratory plot
levels <- rev(bottom_50_summary$country)
ggplot(plot_df_1) +
geom_tile(aes(x = year,
y = factor(country, levels = levels),
fill = bin_n),
colour = "#E7E4E5") +
labs(x= NULL, y = NULL) +
coord_equal() +
# https://gka.github.io/palettes/#/5|s|ffffff,35469d|ffffe0,ff005e,93003a|1|1
scale_fill_manual(values = c("#E7E4E5", "white", '#cfcde7', '#9f9dce', '#6e70b6', '#35469d')) +
#scale_fill_gradient(low = "#D6DCE0", high = "#000DA8") +
theme_minimal() +
theme()ggsave("test_out.svg")library(ggridges)
bottom_50_countries <- bottom_50_summary %>%
slice_head(n = 10) %>%
.$country
gs_first_round_gdp %>%
filter(country %in% bottom_50_countries) %>%
ggplot() +
ggridges::geom_density_ridges(mapping = aes(year,
factor(country, levels = levels),
height = stat(density)),
stat = "density")